Loading the libraries required for this analysis

# install.packages("corrplot")
# install.packages("PerformanceAnalytics")

library(tidyverse)
library(kableExtra)
library(lmtest)
library(MASS)

Introduction to the dataset

The Living Costs and Food Survey (LCF) is an annual survey carried out in United Kingdom by Office for National Statistics since 1957. It collects data on spending pattern and the cost of living of households across UK.



Data Collection Methodology

The LCF sample for Great Britain is a multi-stage stratified random sample with clustering. Address with ‘small user’ postcodes are drawn from the postcode address file. The LCF sample for Northern Ireland, which is part for Great Britain is collected by the central survey unit of Northern Island Statistics and Research Agency(NISRA). A systematic random sample of private addresses is drawn from the land and property service agency property database.

  • LCF is a continuous survey which is collected throughout the year to ensure seasonal effects are covered.

  • Randomly about 11,000 private households are selected each year.

  • Since it is completely voluntary, households can choose not to respond to the survey.

  • Every year on average about 50% of 11,000 households choose to respond to the survey

  • Volunteering household needs to fill a Household questionnaire, Individual questionnaire and dairy to track the daily expenditure for 2 weeks for all individuals aged 16 and over.

Reason for this survey.

The LCF provides information for the Retail Prices Index, National Accounts estimates of household expenditure, the analysis of the effect of taxes and benefits and trends in nutrition. The results, however, are multi-purpose, providing an invaluable supply of economic and social data.

LCF Survey 2013

The dataset 1 which we are analysing in this project is teaching dataset which is a subset of LCF 2013 survey. This dataset has been simplified for the purpose of learning and teaching. This dataset has been anonymised and deposited with the UK data service and can be found here.

LCF 2013 survey has 5,144 respondents out of which 151 were from northern Ireland. Each row in the dataset contains observations from one household.



Purpose of study

Household spending is the amount of final consumption expenditure made by resident households to meet their everyday needs, such as food, clothing, housing (rent), energy, transport, durable goods (notably cars), health costs, leisure, and miscellaneous services. It is typically around 60% of gross domestic product (GDP) and is therefore an essential variable for economic analysis of demand 2.

Economists have traditionally relied on reported household income and expenditure as preferred indicators of poverty and living standards but the use of such indicators can be problematic. Their measurement ususally require lengthly modules and detailed questions which are not practical for households with other priorities. Also the resulting data from those modules can contain errors or reporting biases. 3

As part of this study, we would like to infer the household expenditure based on income and other socioeconomic indicators of the household. The model built will be tested to determine how well it performs at predicting the expenditure of the household.




Variable definitions

The variables in the original dataset are

Variable name Variable label Variable type
casenew Randomly generated case number Scalar
weighta Annual weight Scalar
P550tpr Total expenditure, by adults & children (top-coded) Scalar
P344pr Gross normal weekly household income (top-coded) Scalar
P425r Main source of household income Nominal
A172 Internet connection in household Nominal
A093r Economic position of Household Reference Person Nominal
A094r NS-SEC 3 class of Household Reference Person Nominal
A121r Tenure type Nominal
SexHRP Sex of Household Reference Person Nominal
A049r Number of persons in household Ordinal
G018r Number of adults in household Ordinal
G019r Number of children in household Ordinal
Gorx Government Office Region - modified Nominal
weightar Weight (rescaled) Scalar
maininc Main source of household income (recoded, P425-1) Nominal
income Income Scalar
expenditure Total expenditure (top coded, formerly P550tpr) Scalar
hhsize Household size, number of people in household (recoded)formerly A049r Nominal



Loading the dataset in R

Loading data from the tab delimited file:

lcf_data_raw = read.table("icfforworkbook.tab", sep = "\t", header = TRUE)


colnames(lcf_data_raw)[colnames(lcf_data_raw)=="G018r"] = "noAdults"
colnames(lcf_data_raw)[colnames(lcf_data_raw)=="G019r"] = "noChildren"
colnames(lcf_data_raw)[colnames(lcf_data_raw)=="A049r"] = "noPersons"


# remove rows that have zero income
lcf_data_raw = lcf_data_raw[lcf_data_raw$income > 0,]

# Converting to factor variable
# Income Source
# 1 = EarnedIncome
# 2 = OtherIncome
lcf_data_raw$incomeSource = ifelse(lcf_data_raw$P425r==1,"EarnedIncome","OtherIncome")
lcf_data_raw$incomeSource = as.factor(lcf_data_raw$incomeSource)

# Internet Conncetion in household
# 1 - Yes
# 2 - No
lcf_data_raw$internet = ifelse(lcf_data_raw$A172==1, "Yes", "No")
lcf_data_raw$internet = as.factor(lcf_data_raw$internet)

# Economic position of Household Reference Person
# 1 - Full-time working
# 2 - Part-time working
# 3 - Unemployed and work related Government Training Programmes
# 4 - Economically inactive
lcf_data_raw$economicHRP = ifelse(lcf_data_raw$A093r==1,"FullTime",ifelse(lcf_data_raw$A093r==2,"PartTime",ifelse(lcf_data_raw$A093r==3,"Unemployed","EconomicallyInactive")))
lcf_data_raw$economicHRP = as.factor(lcf_data_raw$economicHRP)

# Class of Household Reference Person
# 1 - Higher managerial, administrative and professional occupations
# 2 - Intermediate occupations
# 3 - Routine and manual occupations
# 4 - Never worked and long term unemployed, students and occupation not stated
# 5 - Not classified for other reasons
lcf_data_raw$SEC3Class = ifelse(lcf_data_raw$A094r==1, "Class1", ifelse(lcf_data_raw$A094r==2, "Class2", ifelse(lcf_data_raw$A094r==3, "Class3", ifelse(lcf_data_raw$A094r==4, "Class 4", "Class 5"))))
lcf_data_raw$SEC3Class = as.factor(lcf_data_raw$SEC3Class)

# Tenure Type
# 1 - Public Rented
# 2 - Private Rented
# 3 - Owned
lcf_data_raw$tenureType = ifelse(lcf_data_raw$A121r==1, "PublicRented", ifelse(lcf_data_raw$A121r==2, "PrivateRented", "Owned"))
lcf_data_raw$tenureType = as.factor(lcf_data_raw$tenureType)

# Sex of household reference person
# 1 - Male
# 2 - Female
lcf_data_raw$sex_HRP = ifelse(lcf_data_raw$SexHRP==1, "Male", "Female")
lcf_data_raw$sex_HRP = as.factor(lcf_data_raw$sex_HRP)


# Number of children in the household
#lcf_data_raw$NoChildren = ifelse(lcf_data_raw$G019r==1, "None", ifelse(lcf_data_raw$G019r==2, "OneChild", "TwoOrMore"))
#lcf_data_raw$NoChildren = as.factor(lcf_data_raw$NoChildren)


# select columns
cols = c("noAdults", "noPersons", "noChildren", "income", "expenditure", "hhsize", "incomeSource", "internet", "economicHRP", "SEC3Class", "tenureType", "sex_HRP")
lcf_data = lcf_data_raw[cols]



# expenditure is top coded so removing those values
#lcf_data = filter(lcf_data,expenditure!=max(lcf_data$expenditure))



Sample data:

head(lcf_data) %>% 
  kable() %>% kable_styling(c("striped","bordered"))
noAdults noPersons noChildren income expenditure hhsize incomeSource internet economicHRP SEC3Class tenureType sex_HRP
2 4 3 465.36 380.6958 4 EarnedIncome Yes EconomicallyInactive Class3 PublicRented Female
2 2 1 855.26 546.4134 2 EarnedIncome Yes FullTime Class 4 Owned Female
1 1 1 160.96 242.1890 1 EarnedIncome Yes FullTime Class2 Owned Female
2 2 1 656.22 421.3824 2 EarnedIncome Yes FullTime Class3 Owned Male
1 1 1 398.80 370.4056 1 EarnedIncome Yes FullTime Class 4 Owned Male
1 1 1 321.02 172.3972 1 EarnedIncome No FullTime Class3 PublicRented Male


The structure of the dataframe is:

str(lcf_data)
## 'data.frame':    5132 obs. of  12 variables:
##  $ noAdults    : int  2 2 1 2 1 1 2 4 2 2 ...
##  $ noPersons   : int  4 2 1 2 1 1 2 5 2 2 ...
##  $ noChildren  : int  3 1 1 1 1 1 1 1 1 1 ...
##  $ income      : num  465 855 161 656 399 ...
##  $ expenditure : num  381 546 242 421 370 ...
##  $ hhsize      : int  4 2 1 2 1 1 2 5 2 2 ...
##  $ incomeSource: Factor w/ 2 levels "EarnedIncome",..: 1 1 1 1 1 1 2 1 1 2 ...
##  $ internet    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 2 2 ...
##  $ economicHRP : Factor w/ 4 levels "EconomicallyInactive",..: 1 2 2 2 2 2 1 1 2 1 ...
##  $ SEC3Class   : Factor w/ 5 levels "Class 4","Class 5",..: 5 1 4 5 1 5 2 2 4 2 ...
##  $ tenureType  : Factor w/ 3 levels "Owned","PrivateRented",..: 3 1 1 1 1 3 1 3 1 1 ...
##  $ sex_HRP     : Factor w/ 2 levels "Female","Male": 1 1 1 2 2 2 2 1 2 2 ...

Method

The following functions will be used to evaluate the models.

get_bp_decision = function(model, alpha = 0.01) { 
  decide = unname(bptest(model)$p.value < alpha) 
  ifelse(decide, "Reject", "Fail to Reject")
}

get_sw_decision = function(model, alpha = 0.01) {
 decide = unname(shapiro.test(resid(model))$p.value < alpha) 
 ifelse(decide, "Reject", "Fail to Reject")
}

get_loocv_rmse = function(model) { 
  sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

get_rmse = function(model) {
  mean(model$residuals ^ 2)
}

get_adj_r2 = function(model) { 
  summary(model)$adj.r.squared
}

# This function takes the model as the input and test 
# a) Normality 
# b) Linearity
# c) Constant variance

assumption_test = function(model){
  par(mfrow=c(1,2))
  
  # Fitted vs residual plot
  plot(fitted.values(model),resid(model),col="dodgerblue", xlab = "Fitted", ylab = "Residuals", main = "Fitted vs Residuals")
  abline(h=0,col="red",lwd=2)
  
  # QQ plot to test normality
  qqnorm(resid(model),col="dodgerblue")
  qqline(resid(model),lwd=2,col="red")
}

Exploratory Data Analysis

First we see any correlation between the numeric varaiables, this will help us removing the highly corrleated variables.

Figure 1 - Correlation and Scatterplot of the variables of the data set

Figure 1 - Correlation and Scatterplot of the variables of the data set

The correlation shown in Figure 1 shows a strong correlation between hhsize and noPersons, referring to the documentation shows that are they capture the same data.

hhsize = noPersons


noPersons = noChild + noAdult

we will remove noPersons , noAdult ,noChildren from the dataset .

# Deleting the columns
lcf_data$noPersons = NULL
lcf_data$noAdults = NULL
lcf_data$noChildren = NULL


Lets do a correlation plot one more time

PerformanceAnalytics::chart.Correlation(select_if(lcf_data,is.numeric), histogram = TRUE, pch = 19)

Model Building

Simple additive model with all variables

Lets create a additive model first

model_add_all = lm(expenditure ~ ., data = lcf_data)
summary(model_add_all)
## 
## Call:
## lm(formula = expenditure ~ ., data = lcf_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -570.77 -120.27  -29.57   87.19  944.18 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              30.10917   22.34003   1.348  0.17779    
## income                    0.45755    0.01196  38.256  < 2e-16 ***
## hhsize                   40.28775    2.76118  14.591  < 2e-16 ***
## incomeSourceOtherIncome  18.52177   10.21878   1.813  0.06996 .  
## internetYes              66.91895    8.30417   8.058 9.54e-16 ***
## economicHRPFullTime     -26.02638   14.11971  -1.843  0.06535 .  
## economicHRPPartTime       5.80551   14.47181   0.401  0.68832    
## economicHRPUnemployed   -34.02913   19.15582  -1.776  0.07572 .  
## SEC3ClassClass 5         -6.79073   17.97521  -0.378  0.70561    
## SEC3ClassClass1          51.29857   17.79294   2.883  0.00395 ** 
## SEC3ClassClass2          28.83109   18.31898   1.574  0.11559    
## SEC3ClassClass3         -21.27501   17.45756  -1.219  0.22303    
## tenureTypePrivateRented  22.02845    8.22774   2.677  0.00744 ** 
## tenureTypePublicRented  -44.78014    8.35520  -5.360 8.71e-08 ***
## sex_HRPMale              19.49748    6.09078   3.201  0.00138 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 197.9 on 5117 degrees of freedom
## Multiple R-squared:  0.5431, Adjusted R-squared:  0.5419 
## F-statistic: 434.5 on 14 and 5117 DF,  p-value: < 2.2e-16


Selecting only the important variables from the above model. Having a smaller model will help interpret the results.

model_add= lm(formula = expenditure ~ income + hhsize +internet, data = lcf_data)
summary(model_add)
## 
## Call:
## lm(formula = expenditure ~ income + hhsize + internet, data = lcf_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -575.95 -123.20  -31.76   83.95  941.11 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.021354   7.722977   3.887 0.000103 ***
## income       0.491726   0.009211  53.386  < 2e-16 ***
## hhsize      35.427781   2.667127  13.283  < 2e-16 ***
## internetYes 74.925596   8.150891   9.192  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 200.6 on 5128 degrees of freedom
## Multiple R-squared:  0.5299, Adjusted R-squared:  0.5296 
## F-statistic:  1927 on 3 and 5128 DF,  p-value: < 2.2e-16

With just the important variable, \(R^2\) did not go down much. Since above simple additive model will help us understand how expenditure is explained by other factor.



assumption_test(model_add)
Figure 3 - Residuals vs Fitted and Normal QQ Plot for the model

Figure 3 - Residuals vs Fitted and Normal QQ Plot for the model


To infer from the model , assumptions of linear model should be met.

  • L - Response is Linear combination of Predictors

  • I - the erros are Independent

  • N - the error should follow Normal distribution

  • E - the error variance is Equal (constant) at any set of predictor values


From the fitted Vs Residual plot, * it seems constant varaiance is a suspect. * Normality is suspect

We need to do a transformation to so that we are not violating the above assumptions.

First let’s try log transformation

Log Transformation


First let’s try log transformation

# Log transformation of response variable
model_log = lm(log(expenditure)~ income + hhsize + internet, data = lcf_data)


summary(model_log)
## 
## Call:
## lm(formula = log(expenditure) ~ income + hhsize + internet, data = lcf_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61228 -0.27563  0.00905  0.27172  1.62147 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.786e+00  1.698e-02  281.84   <2e-16 ***
## income      1.096e-03  2.025e-05   54.11   <2e-16 ***
## hhsize      9.043e-02  5.864e-03   15.42   <2e-16 ***
## internetYes 3.517e-01  1.792e-02   19.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.441 on 5128 degrees of freedom
## Multiple R-squared:  0.5858, Adjusted R-squared:  0.5855 
## F-statistic:  2417 on 3 and 5128 DF,  p-value: < 2.2e-16
# We  will leverage the function we create prior to test if the linear model assumtions are violating
assumption_test(model_log)
Figure 4 - Residuals vs Fitted and Normal QQ Plot for the log transformed model

Figure 4 - Residuals vs Fitted and Normal QQ Plot for the log transformed model


from the above graph ,

  • Normality is NOT a suspect as error follow theoretical quantiles.
  • Constant variance of error is suspect , as error are not evenly distributes for all fitted values.



BoxCox Transformation

We will try Box-Cox transformation

The Box-Cox method considers a family of transformations on strictly positive response variables, \[ g_\lambda(y) = \left\{ \begin{array}{lr}\displaystyle\frac{y^\lambda - 1}{\lambda} & \lambda \neq 0\\ & \\ \log(y) & \lambda = 0 \end{array} \right. \]

The λ parameter is chosen by numerically maximizing the log-likelihood,

model =lm(expenditure ~ income + hhsize + internet, data = lcf_data)
bc=boxcox(model,lambda = seq(-3,3))
Figure 5 - BoxCox for the simple model

Figure 5 - BoxCox for the simple model

(lambda = best_lambda = bc$x[which(bc$y==max(bc$y))])
## [1] 0.2727273
model_bc = lm((expenditure^lambda) ~ ( income + hhsize + internet), data = lcf_data)
summary(model_bc)
## 
## Call:
## lm(formula = (expenditure^lambda) ~ (income + hhsize + internet), 
##     data = lcf_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05082 -0.39991 -0.02937  0.37376  2.43820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.605e+00  2.288e-02  157.58   <2e-16 ***
## income      1.533e-03  2.728e-05   56.19   <2e-16 ***
## hhsize      1.216e-01  7.900e-03   15.39   <2e-16 ***
## internetYes 4.080e-01  2.414e-02   16.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5941 on 5128 degrees of freedom
## Multiple R-squared:  0.5883, Adjusted R-squared:  0.588 
## F-statistic:  2442 on 3 and 5128 DF,  p-value: < 2.2e-16


# We  will leverage the function we create prior to test if the linear model assumtions are violating
assumption_test(model_bc)
Figure 6 - Residuals vs Fitted and Normal QQ Plot for the BoxCox transformed model

Figure 6 - Residuals vs Fitted and Normal QQ Plot for the BoxCox transformed model


Eventhough we see little bit inconsistent in the varaiance, generally it looks ok.

  • Normality is NOT a suspect.
  • Constant variance is NOT a suspect.




Based on the above model, lets find the Train RMSE & Test RMSE.Test RMSE will tell how well the model is performing in the unseen data.

# training and testing datasets
set.seed(420)

# Splitting the lcf dataset into train & test . 80% data goes to train & 20% data goes to test.
lcf_trn_idx = sample(nrow(lcf_data), size = trunc(0.80 * nrow(lcf_data)))
lcf_trn_data = lcf_data[lcf_trn_idx, ]
lcf_tst_data = lcf_data[-lcf_trn_idx, ]

# Building the model using only the training data
model_train = lm(expenditure^lambda ~ income + hhsize + internet, data = lcf_trn_data)


# Train RMSE - prediction on Train using the model build on train dataset
(train_rmse = sqrt(mean((predict(model_train)^(1/lambda) - lcf_trn_data$expenditure)^2)))
## [1] 204.5318
# Test RMSE - Prediction on the Test dataset which is not used for prediction.
# This is a good assessemnet on how this model will perform in the unseen data
(test_rmse =sqrt(mean((predict(model_train,newdata = lcf_tst_data)^(1/lambda) - lcf_tst_data$expenditure)^2)))
## [1] 195.5008




Interaction Model Explanation

Start with a large model,

large_model = lm(expenditure ~ I(income ^ 2) + I(hhsize ^ 2) + (income + hhsize + internet + economicHRP + SEC3Class + tenureType + sex_HRP)^2, data = lcf_trn_data)

Performing backward AIC on the model,

large_model_back_aic = step(large_model, direction = "backward", trace = 0)

and backward BIC on the model,

large_model_back_bic = step(large_model, direction = "backward", trace = 0, k = log(length(resid(large_model))))

The characteristices of the models are

Characteristic Large Model Model from backward AIC Model from backward BIC
Number of predictors 84 37 12
Adjusted \(R^2\) 0.5421597 0.5431252 0.5346943
LOOCV RMSE 199.0421315 198.0560146 199.3892669
Shapiro-Wilks Test Reject Reject Reject
BP Test Reject Reject Reject

The model obtained from the backward AIC search has the lower LOOCV RMSE and higher Adjusted \(R^2\). It did not pass the Shapiro-Wilks and BP tests. The Fitted vs Residual plot and the Normal QQ Plot (figure 7) were then looked at for this model.

assumption_test(large_model_back_aic)
Figure 7 - Residuals vs Fitted and Normal QQ Plot for the model obtained from backward AIC search

Figure 7 - Residuals vs Fitted and Normal QQ Plot for the model obtained from backward AIC search

Figure 7 shows that the constant variance assumption is violated since there is not an even spread of the residuals about the zero line. The plot also shows that the normality assumption is violated.

These results indicate that a transformation of the response is required. Performing a BoxCox transformation on the model,

bc = boxcox(large_model_back_aic, plotit = TRUE, lambda = seq(0, 0.3, by = 0.1))
Figure 8

Figure 8

inter_lambda = bc$x[which.max(bc$y)]

indicates that the value of \(\lambda\) in \(\frac{{y^\lambda}-1}{\lambda}\) is 0.1939394. Applying this transformation to the large model,

large_model_trans = lm((((expenditure ^ inter_lambda)-1)/inter_lambda) ~ I(income ^ 2) + I(hhsize ^ 2) + (income + hhsize + internet + economicHRP + SEC3Class + tenureType + sex_HRP)^2, data = lcf_trn_data)

and performing a backward search using AIC,

large_model_trans_back_aic = step(large_model_trans, direction = "backward", trace = 0)

and a backward search using BIC,

large_model_trans_back_bic = step(large_model_trans, direction = "backward", trace = 0, k = log(length(resid(large_model_trans))))

yields the following results

Characteristic Large Model Model from backward AIC Model from backward BIC
Number of predictors 84 40 19
Adjusted \(R^2\) 0.6177774 0.618087 0.6127457
LOOCV RMSE 1.3266462 1.3187038 1.3244992
Shapiro-Wilks Test Reject Reject Reject
BP Test Reject Reject Reject

The LOOCV RMSE values obtained are lower than those obtained before the transformation and the \(R^2\) value has increased. The Shapiro-Wilks and the BP tests are not passed by these models however.

The model chosen by the backward BIC search is smaller and has similar LOOCV RMSE and Adjusted \(R^2\) values to the backward AIC search and so will be used. The Fitted vs Residual plot and the Normal QQ Plot (figure 9) were then looked at for this model.

assumption_test(large_model_trans_back_bic)
Figure 9 - Residuals vs Fitted and Normal QQ Plot for the model obtained from backward BIC search

Figure 9 - Residuals vs Fitted and Normal QQ Plot for the model obtained from backward BIC search

The plots show that the model assumptions are not being violated. This model was then tested to see how well it performed on predicting the expediture using the test dataset.

large_model_trans_back_bic_prediction = ((inter_lambda * predict(large_model_trans_back_bic, newdata = lcf_tst_data))+1)^(1/inter_lambda)
large_model_trans_back_bic_prediction_diff = (lcf_tst_data[,"expenditure"]-large_model_trans_back_bic_prediction)/lcf_tst_data[,"expenditure"]

The plot of the difference between the expenditure value predicted by the model and the actual values are shown in figure 10.

Figure 10 - difference between predicted and actual values

Figure 10 - difference between predicted and actual values

The plot shows that the predicted values are very close to the actual values and so the model is a good candidate to explain the relationship among these variables.

Comparing the models

The RMSE for the additive model and the interaction model are

Model RMSE on training data RMSE on test data
Additive 204.5318118 195.5007985
Interaction 552.7602248 195.094629



Results



References

1 University of Manchester, Cathie Marsh Institute for Social Research (CMIST), UK Data Service, Office for National Statistics. (2019). Living Costs and Food Survey, 2013: Unrestricted Access Teaching Dataset. [data collection]. 2nd Edition. Office for National Statistics, [original data producer(s)]. Office for National Statistics. SN: 7932, http://doi.org/10.5255/UKDA-SN-7932-2

2 Organisation for Economic Co-operation and Development. “Household Accounts - Household Spending”. oecd.org. https://data.oecd.org/hha/household-spending.htm (Accessed July 21st 2019).

3 D. Ferguson, B & Tandon, A & Gakidou, E & J. L. Murray, C. (2010). Estimating Permanent Income Using Indicator Variables. Health Systems Performance Assessment: Debates, Methods and Empiricism.